Session 1: Introduction

ESS: Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-07-22

Introduction

This Course

tinytable_e0y1bhdvcyte8duo5mrd
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

General Goals:

  • I want to teach you web scraping and data management
  • I also want to give you the tools for reproducible and transparent open science research

The Plan for Today

In this session, you learn how to use the tools of the hunt. We will:

  • discuss some useful tools and learn about:
    • each other
    • the course material and structure
  • go over some principles of using the programming language R:
    • R Refresher
    • literate programming
  • wrap up with some tips and tricks

Woody Kelly via unsplash.com

Who am I?

  • PostDoc at Department of Language, Literature and Communication at Vrije Universiteit Amsterdam and University of Amsterdam
  • Interested in:
    • Computational Social Science
    • Automated Text Analysis
    • Hybrid Media Systems and Information Flows
    • Protest and Democracy
  • Experience:
    • R user since 2015 years
    • R package developer since 2017
    • Worked on several packages for text analysis, API access and web scraping (spacyr, quanteda.textmodels, LexisNexisTools, paperboy, traktok, rollama, amcat4-r, and more)

Who are you?

  • What is your name?
  • What are your research interests?
  • What is your experience with:
    • R
    • HTML
    • web scraping
    • data management
  • Why are you taking this course?
  • Do you have specific plans that include webscraping and data management?

How to use the course material

Prerequisites

You should have this software installed:

R version

You need a relatively recent version of R. This command should at least show R version 4.1.0 (2021-05-18):

R.Version()$version.string
[1] "R version 4.4.1 (2024-06-14)"

Get the course material

Navigate here and clone the repository:

https://github.com/JBGruber/ess-v2-web-scraping-data-management

In RStudio go to “Create a project” (top left corner with this symbol ). Then select “Version Control”:

Or if you are using the command line, you can simply type:

git clone https://github.com/JBGruber/ess-v2-web-scraping-data-management.git
# OR
git clone git@github.com:JBGruber/ess-v2-web-scraping-data-management.git

How I will work

I tend to jump back and forth between the slides and RStudio

Slides

See the html file in each session folder

Source

See the qmd file in each session folder

Use the course marerial

  1. Pull the latest version at the beginning of each session
  2. Make a copy of the qmd file and name it, e.g., “1_Introduction_to_Computing_notes.qmd”
  3. Use this file to make notes, for example by adding comments using this syntax <!-- your comment --> (RStudio shortcut Ctrl + Shift + C / Command + Shift + C on macOS)
  • Alternatively, open the slides in a browser and press e to export them to PDF (and take notes with PDF reader)

This is to make sure you don’t get any git conflicts when you pull and I updated something in the material in the meantime.

Your turn (Exercises 1)

  1. Download the course material and open the RStudio project or folder in your IDE
  2. Open the file readme.qmd from the file explorer of your IDE
  3. Execute the final code Chunk to install all packages we will need for this course

R (re-)fresher

Packages

  • R organises its functions in packages (even base functions)
  • Most packages must be installed (once) and attached (every new session)
install.packages("tidyverse")
library(tidyverse)

Accessing Functions

If you do not want to attach an entire package, you can use the Double Colon to only use a specific function:

dplyr::select(iris, Sepal.Length)
    Sepal.Length
1            5.1
2            4.9
3            4.7
4            4.6
5            5.0
6            5.4
7            4.6
8            5.0
9            4.4
10           4.9
11           5.4
12           4.8
13           4.8
14           4.3
15           5.8
16           5.7
17           5.4
18           5.1
19           5.7
20           5.1
21           5.4
22           5.1
23           4.6
24           5.1
25           4.8
26           5.0
27           5.0
28           5.2
29           5.2
30           4.7
31           4.8
32           5.4
33           5.2
34           5.5
35           4.9
36           5.0
37           5.5
38           4.9
39           4.4
40           5.1
41           5.0
42           4.5
43           4.4
44           5.0
45           5.1
46           4.8
47           5.1
48           4.6
49           5.3
50           5.0
51           7.0
52           6.4
53           6.9
54           5.5
55           6.5
56           5.7
57           6.3
58           4.9
59           6.6
60           5.2
61           5.0
62           5.9
63           6.0
64           6.1
65           5.6
66           6.7
67           5.6
68           5.8
69           6.2
70           5.6
71           5.9
72           6.1
73           6.3
74           6.1
75           6.4
76           6.6
77           6.8
78           6.7
79           6.0
80           5.7
81           5.5
82           5.5
83           5.8
84           6.0
85           5.4
86           6.0
87           6.7
88           6.3
89           5.6
90           5.5
91           5.5
92           6.1
93           5.8
94           5.0
95           5.6
96           5.7
97           5.7
98           6.2
99           5.1
100          5.7
101          6.3
102          5.8
103          7.1
104          6.3
105          6.5
106          7.6
107          4.9
108          7.3
109          6.7
110          7.2
111          6.5
112          6.4
113          6.8
114          5.7
115          5.8
116          6.4
117          6.5
118          7.7
119          7.7
120          6.0
121          6.9
122          5.6
123          7.7
124          6.3
125          6.7
126          7.2
127          6.2
128          6.1
129          6.4
130          7.2
131          7.4
132          7.9
133          6.4
134          6.3
135          6.1
136          7.7
137          6.3
138          6.4
139          6.0
140          6.9
141          6.7
142          6.9
143          5.8
144          6.8
145          6.7
146          6.7
147          6.3
148          6.5
149          6.2
150          5.9

Less often used, you can also do this with library:

library("dplyr", include.only = c("select", "mutate"))
mutate(iris, sepal_length = Sepal.Length * 10) |> 
  select(sepal_length)

The Comprehensive R Archive Network (CRAN)

  • Central repository for R packages
  • Rigorous policies and testing
  • Currently more than 21k packages (July 2024)

Other sources?

  • Rigorous policies and testing are also a downside
    • Developers hesitate to submit packages
    • Unmaintained (but functional) packages are removed from CRAN
  • Alternative repositories are common:
    • GitHub and Gitlab (and SVN)
    • Bioconductor, R-universe 🚀, R-Forge and Omegahat
remotes::install_github("JBGruber/paperboy")

# Install package from Bioconductor
BiocManager::install(c("GenomicRanges", "Organism.dplyr"))

# Install 'magick' from 'ropensci' universe
install.packages("magick", repos = "https://ropensci.r-universe.dev")

Help!

One of the most important commands in R: ?/help:

?install.packages # And
?remotes::install_github # OR
help("install_github", package = "remotes")

All help files in R follow the same structure and principle (although not all help files contain all elements):

  • Title
  • Description
  • Usage:very important: shows you the default values for all arguments (i.e., what is used if you do not set anything) and assumed order
install_github("JBGruber/paperboy") # Same as
install_github(repo = "JBGruber/paperboy",  ref = "HEAD") # Same as
install_github(ref = "HEAD", repo = "JBGruber/paperboy") # Not(!) same as
install_github("HEAD", "JBGruber/paperboy")
  • Arguments: description of arguments in a function. One special argument is the ... (called ellipsis or dots) which is passed to underlying function.
install_github("JBGruber/paperboy", Ncpus = 6)
  • Details: Usually not that important but this is the first place to look when a function is not doing what you expect
  • Examples: where I usually start to learn a new function by looking at cases that certainly work (and then rewriting them for my purposes).

Help!

  • Google (“ggplot2 r remove legends”)
  • Some good resources for answers:
    • stackoverflow.com (if you want to ask a question instead see how to ask a good question and use a reproducible example)
    • R help list (stat.ethz.ch)
    • https://www.r-bloggers.com/ (collection of personal blog posts related to R – so quality varies)
  • ChatGPT
library(askgpt)
log_init()
mean[1:10]
askgpt("What is wrong with my last command?")

Functions

Functions are easy to define in R:

new_fun <- function(x = 1) {
  out <- c(
    sum(x),
    mean(x),
    median(x)
  )
  return(out)
}
new_fun()
[1] 1 1 1
vec <- c(1:10)
new_fun(x = vec)
[1] 55.0  5.5  5.5

Going through this bit by bit:

  • new_fun: The name of the new function (convention: use something descriptive; don’t use . or CamelCase but _ if you have multiple words)
  • <-: The assignment operator.
  • function(x): Define arguments and defaults here.
  • {}: Everything inside the curly brackets is the body of the function (code you are running when calling the function).
  • return(): All objects created inside the function are immediately destroyed when the function finished running. Except what is put in return() (can be implicit).

Data

In R, data is stored in objects. We will learn about different ways to do so tomorrow!

Loops

For loops

Iterate over a vector:

x <- NULL
for (i in 1:10) {
  message(i)
  x <- c(x, i)
}
x
 [1]  1  2  3  4  5  6  7  8  9 10
  • for: This is how you start the loop
  • i: This is the variable which takes a different value in each iteration of the loop
  • in: separates the variable from the vector
  • 1:10: The vector over which to iterate
  • {}: The expression inside the round brackets is evaluated once for each value in the vector; i takes a different value each run

Apply loops

Apply function to each element of a vector/list:

foo <- function(i, silent = FALSE) {
  if (!silent) {
    message(i) 
  }
  return(i)
}
x <- lapply(1:10, foo)
unlist(x)
 [1]  1  2  3  4  5  6  7  8  9 10

purrr::map loops

Also apply function to each element of a vector/list, but coerce types:

foo <- function(i, silent = FALSE) {
  if (!silent) {
    message(i) 
  }
  return(i)
}
x <- purrr::map_int(1:10, foo)
x
 [1]  1  2  3  4  5  6  7  8  9 10

if

if can be used to conditionally run code:

if (TRUE) {
  1 + 1
}
[1] 2
if (FALSE) {
  1 + 1
}

Any code that evaluates to a logical (TRUE/FALSE) can be used:

if (1 + 1 == 2) {
  "Hello!"
}
[1] "Hello!"

You can extend this with else, which is executed when the original condition is FALSE:

if (1 + 2 == 2) {
  "Hello!"
} else {
  "Bye"
}
[1] "Bye"

base R

Commonly people referring to base R mean all functions available when starting R but not loading any packages with library(package).

df <- mtcars # using a built-in example data.frame
table(df$cyl)

 4  6  8 
11  7 14 
sum(df$cyl)
[1] 198
mean(df$cyl)
[1] 6.1875
dist(head(df)) # calculates euclidian distance between cases
                    Mazda RX4 Mazda RX4 Wag  Datsun 710 Hornet 4 Drive
Mazda RX4 Wag       0.6153251                                         
Datsun 710         54.9086059    54.8915169                           
Hornet 4 Drive     98.1125212    98.0958939 150.9935191               
Hornet Sportabout 210.3374396   210.3358546 265.0831615    121.0297564
Valiant            65.4717710    65.4392224 117.7547018     33.5508692
                  Hornet Sportabout
Mazda RX4 Wag                      
Datsun 710                         
Hornet 4 Drive                     
Hornet Sportabout                  
Valiant                 152.1241352
tolower(row.names(df))
 [1] "mazda rx4"           "mazda rx4 wag"       "datsun 710"         
 [4] "hornet 4 drive"      "hornet sportabout"   "valiant"            
 [7] "duster 360"          "merc 240d"           "merc 230"           
[10] "merc 280"            "merc 280c"           "merc 450se"         
[13] "merc 450sl"          "merc 450slc"         "cadillac fleetwood" 
[16] "lincoln continental" "chrysler imperial"   "fiat 128"           
[19] "honda civic"         "toyota corolla"      "toyota corona"      
[22] "dodge challenger"    "amc javelin"         "camaro z28"         
[25] "pontiac firebird"    "fiat x1-9"           "porsche 914-2"      
[28] "lotus europa"        "ford pantera l"      "ferrari dino"       
[31] "maserati bora"       "volvo 142e"         

Especially for simple operations and statistics, base is still great.

model <- lm(hp ~ mpg, data = df) # simple linear regression
summary(model)

Call:
lm(formula = hp ~ mpg, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-59.26 -28.93 -13.45  25.65 143.36 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   324.08      27.43  11.813 8.25e-13 ***
mpg            -8.83       1.31  -6.742 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.95 on 30 degrees of freedom
Multiple R-squared:  0.6024,    Adjusted R-squared:  0.5892 
F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

base R

base also has a plotting system:

plot(df$mpg, df$hp, col = "blue", ylab = "horse power", xlab = "miles per gallon", main = "Simple linear regression")
abline(model, col = "red")
text(30, 300, "We can add some text", col = "red")

Tidyverse

What is it?

  • The official description: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”.
  • The principle that gives the tidyverse its name is that of tidy data: “Each variable forms a column. Each observation forms a row.” (see tidyr vignette for more info)
  • Seems trivial at first but as a principle can be quite consequential (e.g., it means that most object types are ignored and data.frames are very dominant)
  • Some coding principles attached to it (e.g., the pipe, functions as verbs that build on each other)

The pipe

  • Formerly %>%, now native in R as |>
  • Forwards the result of one function to another
  • Makes for much more readable code:
transform(aggregate(. ~ cyl, data = subset(mtcars, hp > 100), FUN = function(x) round(mean(x, 2))), kpl = mpg * 0.4251)
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

You Can make this more readable by createing intermediate objects:

data1 <- subset(mtcars, hp > 100) # take subset of original data
data2 <- aggregate(. ~ cyl, data = data1, FUN = function(x) round(mean(x, 2))) # aggregate by taking rounded mean
transform(data2, kpl = mpg * 0.4251) # convert miles per gallon to kilometer per liter
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

Or you use the pipe:

subset(mtcars, hp > 100) |> 
  aggregate(. ~ cyl, data = _, FUN = function(x) round(mean(x, 2))) |> 
  transform(kpl = mpg * 0.4251)
  cyl mpg disp  hp drat wt qsec vs am gear carb     kpl
1   4  26  108 111    4  2   18  1  1    4    2 11.0526
2   6  20  168 110    4  3   18  1  0    4    4  8.5020
3   8  15  350 192    3  4   17  0  0    3    4  6.3765

tidyverse functions are written with pipes in mind and are named as verbs with the goal to tell you exactly what they do:

library(tidyverse)
mtcars |> 
  filter(hp > 100) |> 
  group_by(cyl) |> 
  summarise(across(.cols = everything(), .fns = function(x) x |> mean() |> round(2))) |> 
  mutate(kpl = mpg * 0.4251)
# A tibble: 3 × 12
    cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb   kpl
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     4  25.9  108.  111   3.94  2.15  17.8  1     1     4.5   2    11.0 
2     6  19.7  183.  122.  3.59  3.12  18.0  0.57  0.43  3.86  3.43  8.39
3     8  15.1  353.  209.  3.23  4     16.8  0     0.14  3.29  3.5   6.42

Note: You can interject the View() command at any line in a complicated pipeline to see the intermediate result in a spreadsheet-style data viewer.

Special package ggplot2

  • Completely overhauls the plotting system in R
  • IMO: the best plotting system in any programming/data science language
  • Implements the “Grammar of Graphics”: a language for describing custom plots instead of relying on predefined plotting functions
  • The specific logic makes it harder to learn than other packages, but you can express essentially any plots in it (I highly recommend using “ggplot2: Elegant Graphics for Data Analysis” to learn the package instead of individual tutorials)

Exercises 2

  1. Run ggplot(data = mpg). What do you see and why?
  2. In the function pb_collect() from paperboy, what do the arguments ignore_fails and connections do?
  3. Write a function that takes a numeric vector of miles per gallon consumption data and transforms it to kilometer per liter. If anything other than a numeric vector is entered, the function should display an error (hint: see ?stop).
  4. In the code below, check the sizes of the intermediate objects with object.size().
file_link <- "https://raw.githubusercontent.com/shawn-y-sun/Customer_Analytics_Retail/main/purchase%20data.csv"
df <- read.csv(file_link)
filtered_df <- df[df$Age >= 50,]
aggregated_df <- aggregate(filtered_df$Quantity, by = list(filtered_df$Day), FUN = sum)
names(aggregated_df) <- c("day", "total_quantity")
aggregated_df[order(aggregated_df$total_quantity, decreasing = TRUE)[1:5],]
    day total_quantity
162 162             73
460 460             73
123 123             61
183 183             60
340 340             57
  1. How could the code above be improved if you only want the final result, the code should be readable and you care about memory usage?

Literate Programming

Background

“The language in which we express our ideas has a strong influence on our thought processes.”

― Donald Ervin Knuth, Literate Programming

  • When analysing data in R, a cornerstone of a good workflow is documenting what you are doing.
  • The whole point of doing data analysis in a programming language rather than a point and click tool is reproducibility.
  • Yet if your code does not run after a while and you don’t understand what you were doing when writing the code, it’s as if you had done your whole analysis in Excel!

Advantages

This is where literate programming has a lot of advantages:

  1. Enhanced Documentation: Literate programming combines code and documentation in a single, integrated document. This approach encourages researchers to write clear and comprehensive explanations of their code, making it easier for others (and even themselves) to understand the working of the code, (research) design choices, and logic.
  2. Improved Readability: By structuring code and documentation in a literate programming style, the resulting code becomes more readable and coherent. The narrative flow helps readers follow the thought process and intentions of the programmer, leading to improved comprehension and maintainability.
  3. Modular and Reusable Code: Literate programming emphasizes the organization of code into coherent and reusable chunks as they writers come to think of them similar to paragraphs in a text, where each chunk develops one specific idea.
  4. Collaboration and Communication: Literate programming enhances collaboration among developers by providing a common platform to discuss, share, and review code. The narrative style fosters effective communication, allowing team members to understand the codebase more easily and collaborate more efficiently.
  5. Extensibility and Maintenance: Well-documented literate programs are typically easier to extend and maintain over time. The clear explanation of choices and functionality helps yourself and others in the future to make decisions about modifications, enhancements, and bug fixes.
  6. Reproducibilty and accountability: when you save rendered output of an analysis, you know exactly how a table of plot was created. If there are several versions, you can always turn to the rendered document and check which data, code and package versions were used to do your analuysis (at least when documents were written in a specific way.

Quarto (and its predecessor R Markdown) were designed to make it easy for you to make the most of these advantages. We have already been using these tools throughout the workshop and I hope this made you more familiar with them.

Exercises 4

  1. Use the function report_template() from my package jbgtemplates to start a new report
  2. Add some simple analysis in it and render
  3. Play around with the formats and produce at least a PDF and Word output of your document
  4. Think about how the structure of the document enhances reproducibility

Some other tricks

The worst default setting in RStudio

The default setting to ask whether to save the current session is horrible. It eventually leads to you clicking yes, save the data and to the creation of a file called .Rdata. This file is loaded whenever you open RStudio! This makes RStudio slow and can lead to unexpected behaviour, even when you delete all objects in your environment with rm(list = ls()).

Exercises

  1. Change that setting NOW and look for the .Rdata in your project and home directory.

Git some Version Control

Git is an extensive application and too much to go through. But you do not need all the functionality to make efficient use of it’s main promise: keeping track of what is happening in your projects and giving you the ability to revert to an older state of your project.

This screenshot shows the last months of my PhD, when I was furiously working on integrating comments from different people. Not only helped git to show me my progress nicely, I also did not have to worry about accidentally deleting anything that might still prove valuable. Especially towards the end, I often removed sections or copied them to other chapters. Whenever I could not find a specific section, I went back to the last commit when I could still remember where it was.

Additionally, GitHub offers some nice features to organise and plan projects around issues.

Here you can note down remaining problems and keep track of your progress. It keeps your head free for other things!

Learn more, e.g., at: https://happygitwithr.com/

Homework

Many of you did not come to class to just scrape exercise pages. You probably had some initial data and/or research question in mind. Please write a short abstract on what you want to accomplish with the web scraping and data management skills you will learn here. The abstract should include:

  • general goal
  • research question
  • (preliminary) assessment what data you need what data can be found on the website and what potential research questions you have in mind.

Deadline: Tuesday midnight

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_1.0.4       tinytable_0.3.0   reticulate_1.37.0 lubridate_1.9.3  
 [5] forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2      
 [9] readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1    
[13] tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    xml2_1.3.6        stringi_1.8.4    
 [5] lattice_0.22-6    hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
 [9] evaluate_0.23     grid_4.4.1        timechange_0.3.0  fastmap_1.2.0    
[13] jsonlite_1.8.8    Matrix_1.7-0      processx_3.8.4    chromote_0.2.0   
[17] ps_1.7.6          promises_1.3.0    httr_1.4.7        fansi_1.0.6      
[21] scales_1.3.0      codetools_0.2-20  cli_3.6.2         rlang_1.1.3      
[25] munsell_0.5.1     withr_3.0.0       yaml_2.3.8        tools_4.4.1      
[29] tzdb_0.4.0        colorspace_2.1-0  curl_5.2.1        vctrs_0.6.5      
[33] R6_2.5.1          png_0.1-8         lifecycle_1.0.4   pkgconfig_2.0.3  
[37] later_1.3.2       pillar_1.9.0      gtable_0.3.5      glue_1.7.0       
[41] Rcpp_1.0.12       xfun_0.44         tidyselect_1.2.1  rstudioapi_0.16.0
[45] knitr_1.46        farver_2.1.2      websocket_1.4.1   htmltools_0.5.8.1
[49] labeling_0.4.3    rmarkdown_2.27    compiler_4.4.1